-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update proposal for maxUnavailable for statefulsets #1010
Conversation
/assign @bgrant0607 @kow3ns @janetkuo @Kargakis |
/assign @bgrant0607 @kow3ns @janetkuo @Kargakis |
Great, the description of choices is very clear. Also, i still recommend choice 3, it keeps balance between faster updating and ordering guarantee. The problem of choice 1 is that it looks like sample, but actually does not. Let me take an example:
|
Is this even possible ? The replicas cannot be scaled to 5 before the update finishes i believe. The update knows to start from highest ordinal and count maxUnavailablePods at a time. Note the goal is faster rolling updates. In faster rolling updates, we could potentially provide two modes, one with ordering and one which says ordering in the maxUnavailable group is not guaranteed. I dont think we have enough data to say we need the modes. We can start with a option which makes most sense for faster updates. The consumer of faster updates would know the ordering is not guaranteed when using this feature and that is a conscious choice the user will make. |
Actually, it can. You may test this case in your Kubernetes cluster or minikube e.g. So as i said before, controller with choice 1 will be confused when user modify replicas during updating. |
with ordinal 2 will start Terminating. Pods with ordinal 0 and 1 will remain untouched due the partition. In this choice, the number of pods | ||
terminating is not always maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we | ||
still wait for 4 to be running and ready before moving on to 2. This implementation avoids out of order Terminations of pods. | ||
2: Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three alternatives are getting merged into one paragraph; needs an extra newline between each alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a newline
#### Implementation | ||
|
||
TBD: Will be updated after we have agreed on the semantics beign discussed above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being
|
||
I recommend Choice 1, simply because its easy to think and reason about. | ||
|
||
Other choices are introduce a flag, which controls whether we want maxUnavailable with ordering or without ordering. I dont think we need that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List this as Choice 4?
We can specify that in documentation, what the behavior will be. Note that changing the grouping is not a big deal. The main goal is faster updates, whether the pod gets updated in one grouping or another doesnt matter, unless you have a good reason it matters. |
Adding more people to see if they have opinions on the choices presented here |
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we | ||
could track the revisions by ControllerRevisions. | ||
4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding | ||
that requires diving into the complicacy of hashing and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this point should be made clear in feedback to @kubernetes/sig-apps-api-reviews - are there any supporting statements we should have here about changes that should happen in deployments?
that requires diving into the complicacy of hashing and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which | ||
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which are much easier to think and reason | ||
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in | ||
StatefulSet's and additionally we could track the revisions by ControllerRevisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm - that sounds like a dangerous statement. Are you saying stateless deployments should be implemented by StatefulSets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton I will reword this. What i am trying to say is that sometimes , people just want this little extra thing which statefulsets provide which is stable identity, without the storage part , but everything else is exactly similar to deployments. They stick with deployments, because statefulset doesnt have maxUnavailable . If statefulset had maxUnavailable they could move to it and additionally take advantage of better tracking of versions.
In general I do feel that the distinction between statefulset and deployment is blurring. The only advantage at this point i can think of using deployments vs statefulset is that deployments can maxSurge but statefulset's cannot because of their identity. But that is a different discussion of making them as one object.
|
||
### Goals | ||
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time, | ||
during Rolling Update. | ||
|
||
### Non-Goals | ||
maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond | ||
maxUnavailable is only implemented to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton because podManagementPolicy is only applicable during StatefulSet creation, deletion and scaling. The main use case is faster rolling updates. We could make it applicable to create/delete/scale, but that is secondary to the goal. I thought we could just limit the scope of the initial implementation to make it less confusing, but if you think we should apply this to both, that is fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So during upgrade, if 3 of 5 nodes are evacuated, PMP is ignored? Since this is mostly established behavior of replicasets and deployments, you should mention why it's out of scope. I am more concerned with creating something that differs between RS/Deployment and StatefulSet right now that I am about implementation ordering, so I just want the KEP to reflect that it is a) consistent with user expectations and b) justifies statements about timing with reasons
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
this time, only 1 is unavailable although we requested 2. | ||
are the possible behavior choices we have:- | ||
|
||
1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton do you have any suggestions/opinions on which of these semantics makes more sense when implementing maxUnavailable ?
@smarterclayton @kow3ns based on discussion in sig-apps meeting and offline discussion with Clayton, I have updated the proposal. Please take and look and provide feedback. |
|
||
##### Recommended Choice | ||
|
||
I recommend Choice 4, using PMP=Parallel for the first Alpha Phase. This would give the users fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, I think users will want the behvior of (1) above if they are using Ordered PodManagement and the behavior of (2) if they are using parallel. Consider implementing just these semantics. That is, if PMP is ordered the user is declaring that they care about termination ordering, even if they decide they are willing to tolerate a larger number of distruptions during an update. If the PMP is Parallel, the user does not care about about the termination ordering during turn-up/turn-down. If they want to do care about the termination ordering during update they can set the maxUnvailable field to 1 to preserve the current behavior. If they wish to tolerate a larger number of disruptions they can increase its value.
@kow3ns can you approve the changes to mark it |
For reference: the Kruise third-party project provides an advanced StatefulSet implementation with |
I'm glad to see OpenKruise been noted in this KEP. @praseodym Kruise is a set of controllers which extends and complements Kubernetes core controllers on workload management, and the AdvancedStatefulSet provides features like maxUnavailable, inPlaceUpdate and so on. |
/approve |
@krmayankk: you cannot LGTM your own PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kow3ns, krmayankk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm generally not if favor of both approving and lgtming something. However, we've discussed this in the SIG multiple times over a period of more than a month. |
thanks @kow3ns If there are any follow up comments , suggestions from other reviewers, i am happy to make follow up PR's while i make progress on the implementation |
Kindly ping @krmayankk Are we going to graduate this feature to beta in v1.26, or do we have any plans? |
/sig apps